The data are MC generated (see below) to simulate registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique. Cherenkov gamma telescope observes high energy gamma rays, taking advantage of the radiation emitted by charged particles produced inside the electromagnetic showers initiated by the gammas, and developing in the atmosphere. This Cherenkov radiation (of visible to UV wavelengths) leaks through the atmosphere and gets recorded in the detector, allowing reconstruction of the shower parameters. The available information consists of pulses left by the incoming Cherenkov photons on the photomultiplier tubes, arranged in a plane, the camera. Depending on the energy of the primary gamma, a total of few hundreds to some 10000 Cherenkov photons get collected, in patterns (called the shower image), allowing to discriminate statistically those caused by primary gammas (signal) from the images of hadronic showers initiated by cosmic rays in the upper atmosphere (background).
Typically, the image of a shower after some pre-processing is an elongated cluster. Its long axis is oriented towards the camera center if the shower axis is parallel to the telescope's optical axis, i.e. if the telescope axis is directed towards a point source. A principal component analysis is performed in the camera plane, which results in a correlation axis and defines an ellipse. If the depositions were distributed as a bivariate Gaussian, this would be an equidensity ellipse. The characteristic parameters of this ellipse (often called Hillas parameters) are among the image parameters that can be used for discrimination. The energy depositions are typically asymmetric along the major axis, and this asymmetry can also be used in discrimination. There are, in addition, further discriminating characteristics, like the extent of the cluster in the image plane, or the total sum of depositions.
The data set was generated by a Monte Carlo program, Corsika, described in: D. Heck et al., CORSIKA, A Monte Carlo code to simulate extensive air showers, Forschungszentrum Karlsruhe FZKA 6019 (1998).
The program was run with parameters allowing to observe events with energies down to below 50 GeV.
Attribute Information:
g = gamma (signal): 12332 h = hadron (background): 6688
We visualize, for each feature and class, the outliers. For a given empirical distribution, let $q_1$ be the first quartile and $q_3$ the third quartile (25-th and 75-th percentile respectively). We represent with colors the points that exceed the interval $\left[4q_1-3q_3],4q_3-3q_1\right]$, and in white the points that don't exceed that interval but exceed $\left[q_1-1.5(q_3-q_1),q_3+1.5(q_3-q_1)\right]$ (notice that $q_3-q_1$ is the inter-quantile range, IQR).
If we were domain experts we would be able to decide whether to keep the outliers or not, but since we are not we decide to keep them. Indeed, those outliers could be legitimate but rare data (notice that there are no negative lengths and so on) and hiding those data when training would result in a loss of important information and in a weaker classifier that would maybe perform better on usual samples, yet it would certainly be unable to classify the rarest cases.
Since we want to apply PCA later, and since this technique uses covariance matrix (actually correlations, since we also standardize), we know it would perform poorly on nonlinear relationships. Indeed, a strong nonlinear relation could result in a low correlation anyway, so during PCA we wouldn't detect it. The following scatter shows us the relationships between the features, and after that we show their empirical distributions:
The heatmap is currently the following:
We immediately recognize many nonlinear relationships (especially with fConc, fConc1, fAsym, fM3Long and fM3Trans). To make the linearity come out, we first center and take the absolute value of fAsym, fM3Long and fM3Trans, that, even though they follow a bimodal distribution, there is no apparent discrimination related to the sign, so the signed value is useless. Secondly, since fConc and fConc1 are ratios, we apply a logarithmic transformation on them, so that we get a subtraction between numerator and denominator rather than their ratio.
Notice that the denominator is fSize, so by taking $\log(fConc)$ we have $\log$(sum of two highest pixels)$-\log(fSize)$, so we should maybe apply a logarithmic transformation on fSize as well, but it seems that $fSize \sim \log(fLength)$ and $fSize \sim \log(fWidth)$, so if we applied the logarithm on fSize we would get $log(fSize) \sim \log(\log(fLength))$ (same for fWidth) and we would intricate everything more than necessary, so we just apply the logarithm on fConc and fConc1 and leave everything else as it is.
After transforming and standardizing, this is the much nicer result (even though not perfect):
Also notice how the correlations are much more evident now:
The objective here is taking data in a high dimensional space and mapping it into a lower dimensional space. The reason is that we want to reduce training and testing time, computational effort and get rid of uninformative features that may be present. Also, in high dimensional spaces there is what's called the curse of dimensionality, for which samples are closer to the boundary of the sample space than to other samples, which makes distances lose significance. More exactly, the probabilty that, for instance, a uniform variable on the unit p-dimensional hypersphere belongs to the shell between the hyperspheres of radius 0.9 and 1 is $$P(X\in S_{0.9}(p))=1-0.9^p\underset{p\rightarrow\infty}{\rightarrow}1$$
The dimensionality reduction techniques we see are linear, therefore they can be expressed as a map $\mathbf{x}\mapsto W\mathbf{x}$, where $W\in\mathbb{R}^{n,d}$ and $n<d$.
We want to perform a dimensionality reduction such that later we'll be able to approximately recover $\mathbf{x}$ from $\mathbf{y} = W\mathbf{x}$. Such a linear recovery is performed by doing $U\mathbf{y}=UW\mathbf{x}$, so we want to find the $W,U$ that solve $$\underset{W\in\mathbb{R}^{n,d},U\in\mathbb{R}^{d,n}}{argmin}\sum_{i=1}^m \|\mathbf{x}_i-UW\mathbf{x}_i\|^2$$
Given $A=X^\intercal X$ and $\mathbf{u_1},\dots,\mathbf{u_n}$ the $n$ leading eigenvectors of A, the solution is $U=[\mathbf{u_1},\dots,\mathbf{u_n}]$ and $W=U^\intercal$.
It's a common practice to center the examples before applying PCA, that is, letting $\mathbf{\mu}=\frac{1}{m}\sum_{i=1}^m \mathbf{x}_i$, we apply PCA on the vectors $(\mathbf{x_1-\mu}),\dots,(\mathbf{x_m-\mu})$. Actually, if the dimensions are on different measurement units, it's better to standardize the whole dataset. Indeed, since PCA is a variance maximization technique, dimensions whose scale is "dominant" may negatively affect the procedure, leading it.
Now, being $r$ the rank of $A$, let $\sigma_1\ge\sigma_2\ge\dots\sigma_r$ be the square roots of its eigenvalues in descending order. We have that the variance in the direction of the $k-th$ principal component is given by the corresponding singular value $\sigma_k^2$. The total variance is $\sum_{j=1}^n \sigma_j^2$, whereas the cumulative variance explained by the first $k$ principal components is $\sum_{j=1}^k \sigma_j^2$, so that the percentage of the explained variance after the reduction is $$\frac{\sum_{j=1}^k \sigma_j^2}{\sum_{j=1}^n \sigma_j^2}$$ We want this quantity to be as high as possible, but 0.85 is a good enough value.
Also notice that, if X is centered, then $X^\intercal X$ is the covariance matrix.
With the following graph, we see that keeping 7 principal components is more than enough, with a cumulative explained variance equal to 98.3% of the total one. We also show the resulting distributions.
A second technique we could have used is random projections, used when we don't care about reconstruction but just about not distorting distances (i.e., we want that $\forall i,j, \|\mathbf{x_i-x_j}\| \approx \|W\mathbf{x_i}-W\mathbf{x_j}\|$.
Let's take the transformation $\mathbf{x}\mapsto W\mathbf{x}$, where $W$ is a random matrix, i.e. $W_{i,j}\sim N(0,\frac{1}{n})$.
Let $Q$ be a finite set of vectors in $\mathbb{R}^d$; let $\delta\in(0,1)$ and $n$ be an integer s.t. $$\epsilon=\sqrt{\frac{6\log(2|Q|/\delta)}{n}}\le 3$$
Then, we have $$P(\max_{x\in Q}|\frac{\|W\mathbf{x}\|^2}{\|\mathbf{x}\|^2}-1|\lt\epsilon)\ge 1-\delta$$
In other words, the maximum distance is guaranteed not to exceed $\epsilon$ with probability $1-\delta$, if we choose the dimension $n$ of the new space such that $\epsilon\le 3$. However, notice that, if we wanted for example $\epsilon=0.1$ and $\delta=0.1$, since $|Q|=19020$ for this dataset we would need $\frac{6\log(2*19020/0.1)}{0.1^2}=3348$ dimensions, way bigger a number than the original one. This is the reason why we don't use this technique, since it would be more adapt for either smaller datasets or when we have a huge number of original dimensions, which is not the case.
We are about to test five different classifiers (SVM, FDA, Decision Tree, RF and Logistic regression) each of which can have different hyperparameters. This means that not only we have to choose the best classifier for us, but also the best hyperparameters for each classifier in order to compare it to the other ones. In order to do this, we choose to use stratified k-CV for each model-hyperparameter combination using AUC as a score (actually, for each model, the different hyperparameters were tested on the same fold, rather than repeating the splitting each time the hyperparameter changes, so that the differences depend only on the different hyperparameters and not on the different splitting). Furthermore, undersampling was necessary due to the imbalance in the dataset (there's twice as many b (background) as h (signals)) caused by technical reasons. Finally, since SVM is distance-based, min-max rescaling was necessary before training.
First of all, the reason to use cross-validation rather than a simple training-test split comes from the fact that, with the latter, there's basically no validation accuracy. Indeed, the score is highly dependent on the particular split occurred, so that if we repeated the split we would get a (sometimes completely) different score. K-Fold cross-validation overcomes this problem by splitting the dataset in k different partitions equally sized. Then we iterate k times, using k-1 partitions for training and the remaining one for testing, thus calculating the error at each k. Finally, we average all the k errors.
Regarding testing error, given the error at the k-th fold $$Err_{k}=\sum\nolimits_{i \in C_{k}}\frac{I(y_{i}\neq\hat{y_{i}})}{n_{k}}$$ (that is, summing all the times the prediction is incorrect for all the elements in the k-th test set $C_{k}$, then diving by its cardinality $n_{k}$), the average error is $$CV_{K}=\sum_{k=1}^K\frac{n_{k}}{n}Err_{k}$$
Stratified k-CV does the same thing with the only difference that the P/N percentage in the original dataset is unaltered after the split. The following is an example where males (M) are three times as frequent as females (F):
For computational reasons due to the large size of the dataset, $k=5$ is used.
In our dataset there are:
However, this evident imbalance is only due to technical reasons, and it would be a mistake to keep this proportion when training our classifiers, since they would be more likely to classify an event as a signal than background (whereas background events are actually a majority in the real data).
Due to the large size of the dataset, it would be useless and wrong to oversample the minority class. Hence, when cross-validating, at each split we randomly undersample the majority class on the training set only so that the two classes are equally distributed. We then train our classifier on the undersampled training set and test it on the non-undersampled test set. Not undersampling the latter is important because, otherwise, it would no longer reflect the real distribution and we would test our classifier on a different "reality".
Since classifying a background event as signal is worse than the opposite, the classical accuracy score is not meaningful. Hence we use AUC instead (Area Under the ROC Curve). A ROC curve is an x-y plot where the x is the FPR (False-Positive Rate) and the y is the TPR (True-Positive Rate). We have that $$FPR = \frac{FP}{FP+TN}$$
$$TPR = \frac{TP}{TP+FN}$$Each (FPR,TPR) point is calculated at different probability thresholds at which we classify something as positive or negative (for instance, let's assume our classifier says "positive with probability 0.8". If the threshold is 0.7, then we consider it positive, whereas if it's 0.9 we consider it negative). The following is an example of some ROC curves:
The area under the curve (AUC) can be seen as the probability that we get more true positives than false positives. Both FPR and TPR range from 0 to 1, so the AUC can be at most 1 (ideal case). If it's 0.5, it means that our classifier takes basically random choices. If it's less than 0.5, it's better to consider the opposite of its predictions.
When validating a classifier with stratified k-CV, we use AUC this way:
We repeat the same procedure for each classifier, so the best one for us is the one with highest mean AUC given its best hyperparameter.
A support vector machine finds a hyperplane (or a set of hyperplanes) that can be used for classification. A good separation is intuitively achieved by the hyperplane with the largest distance to the nearest training data points of any class. Given a hyperplane $L={\mathbf{v}:\langle\mathbf{w},\mathbf{v}\rangle+b=0}$, the distance of a given $\mathbf{x}$ to $L$ is $$d(\mathbf{x},L)=\min\{\|\mathbf{x-v}\|:v \in L\}$$ and, if $\|\mathbf{w}\|=1$, then $d(\mathbf{x},L)=|\langle\mathbf{w},\mathbf{v}\rangle+b|$.
Since a separating hyperplane is defined by $(\mathbf{w},b)$ s.t. $\forall{i}, y_i(\langle\mathbf{w,x_i}\rangle+b)>0$ (assuming $y_i$ can only be $+1$ or $-1$), its margin is the distance of the closest training examples to it, called support vectors. We say that a distribution $D$ is separable with a $(\gamma,\rho)$ margin if exists $(\mathbf{w^*},b^*)$ s.t. $\|\mathbf{w^*}\|=1$ and $$D(\{(\mathbf{x},y):\|\mathbf{x}\|\le\rho \land y(\langle\mathbf{w^*,x}\rangle+b^*) \ge \gamma \})=1$$ We therefore seek for the separating hyperplane with largest margin, but this would mean assuming that the data is separable. If it's not the case (very common) the analysis performed is called soft-SVM and can be formulated as $$\underset{\mathbf{w,\xi},b}{argmin}(\lambda\|\mathbf{w}\|^2+\frac{1}{m}\sum_{i=1}^m \xi_i)$$ $$s.t. \forall i, y_i(\langle\mathbf{w,x_i}\rangle+b)\ge 1-\xi_i \land \xi_i\ge 0$$ where we basically introduced a regularization parameter to avoid overfitting, depending on an $l^2$ penalty.
Sklearn's SVM actually solves the so-formulated problem: $$\underset{\mathbf{w,\xi},b}{argmin}\frac{1}{2}\|\mathbf{w}\|^2+C\sum_{i=1}^n \xi_i$$ subject to $\forall i, y_i(\langle\mathbf{w,x_i}\rangle+b)\ge 1-\xi_i \land \xi_i\ge 0$
We are basically trying to maximize the margin (by minimizing $\|\mathbf{w}\|^2$), while incurring a penalty when a sample is misclassified or within the boundary. $C$ controls the strength of this penalty (the more we penalize, the more we overfit and the less we regularize), so it acts as an inverse regularization parameter (and it actually is our hyperparameter).
Notice that, by setting each $\xi_i=0$, the problem becomes hard-SVM (which requires the hypothesis of separability to be feasible).
Finally, since SVM is distance-based, we train and test on the scaled data using the Min-Max scaling ($x' = \frac{x-min}{max-min}$).
The method we discussed so far finds a hyperplane described by a linear equation, which means that the boundary we get is linear. This may be good when data are linearly separable, but that's not always the case. Consider the following example on the left:
It's clear that there's no hyperplane that can effectively separate these two classes. However, if we project the data to a higher dimensional space (right), it turns out that they can actually be linearly separated. The insight we get is that by means of some transformation (which can go from a simple rotation to a more complex polynomial transformation that takes us to a higher dimensional space, possibly infinite) we can find the separating hyperplane on the resulting space (getting a linear boundary), and then go back to the original one (getting a possibly nonlinear boundary).
To do this, let's define the transformation $\psi:X\rightarrow F$, where $F$ is some feature space and it's a subset of a Hilbert space. In other words, $X$ is the original feature space (where each dimension is a predictor of the sample), $F$ is the new one and $\psi$ is the application that takes us from one to the other.
N.B. A Hilbert space is a metric space over $\mathbb{R}$ or $\mathbb{C}$ provided with an inner product and that is also complete with respect to the distance function defined by its inner product.
(A metric space is a space $F$ with a metrics on it, that is a distance function $d:F \times F \rightarrow \left[0,+\infty\right[$ such that $\forall x,y,z \in F$:
Now, a Cauchy sequence in $F$ is a sequence $\mathbf{x_1,x_2},\dots$ such that $\forall \epsilon \in \left[0,+\infty\right[ \subset \mathbb{R}$, there is a positive integer $N$ such that for all integers $m,n>N$ we have $d(\mathbf{x_m,x_n})\lt \epsilon$. A metric space $F$ such that every Cauchy sequence in $F$ converges in $F$ is said to be a complete metric space.)
Going back to SVM, we train a halfspace over $((\psi(\mathbf{x_1}),y_1),\dots,(\psi(\mathbf{x_m}),y_m))$. The choice of $\psi$ requires prior knowledge, but there are some generic mappings that are commonly used, for example polynomial ones. However, since it can be computationally expensive to do all the inner products in the resulting high dimensional space, we can use what are called kernels. A kernel function for a mapping $\psi$ is a function that implements inner product in the feature space, $$K(\mathbf{x,x'})=\langle\psi(\mathbf{x}),\psi(\mathbf{x'})\rangle$$ It is sometimes easy to calculate $K(\mathbf{x,x'})$ efficiently without even applying $\psi$.
The Representer Theorem
For any learning rule of the form $$\mathbf{w^*}=\underset{\mathbf{w}}{argmin}(f(\langle \mathbf{w},\psi(\mathbf{x_1})\rangle,\dots,(\langle \mathbf{w},\psi(\mathbf{x_m})\rangle+\lambda\|\mathbf{w}\|^2)$$ where $f:\mathbb{R}^m\rightarrow \mathbb{R}$ is an arbitrary function. Then, $\exists \mathbf{\alpha}\in \mathbb{R}^m$ such that $\mathbf{w^*}=\sum_{i=1}^m \alpha_i\psi(\mathbf{x_i})$.
By this theorem, the optimal solution can be written as $\mathbf{w}=\sum_i \alpha_i\psi(\mathbf{x_i})$.
Denoting by $G$ the matrix such that $G_{i,j} = \langle\psi(\mathbf{x_i}),\psi(\mathbf{x_j})\rangle$, the previous problem becomes $$\underset{\alpha\in\mathbb{R}^m}{argmin}(f(G\mathbf{\alpha})+\lambda\mathbf{\alpha}^\intercal G\mathbf{\alpha})$$ $G$ is the Gram matrix and it only depends on inner products, so it can be calculated using $K$ alone. So, after we find $\mathbf{\alpha}$, then, given a new instance, $$\langle\mathbf{w},\psi(\mathbf{x})\rangle=\sum_j\alpha_j\langle\psi(\mathbf{x_j}),\psi(\mathbf{x})\rangle=\sum_j\alpha_j K(\mathbf{x_j,x})$$ so we can do prediction and training using $K$ only.
Finally, we can formulate soft-SVM this way: $$\min_{\alpha\in\mathbb{R}^m}(\lambda\mathbf{\alpha}^\intercal G\mathbf{\alpha})+\frac{1}{m}\sum_{i=1}^m\max\{0,1-y_i(G\mathbf{\alpha})_i\})$$ and hard-SVM this way: $$\min_{\alpha\in\mathbb{R}^m}\mathbf{\alpha}^\intercal G\mathbf{\alpha}$$ $$s.t. \forall i, y_i(G\mathbb{\alpha})_i\ge 1$$
The most commonly used kernels are the polynomial kernel, defined to be $$K(\mathbf{x,x'})=(1+\langle\mathbf{x,x'}\rangle)^k$$ used when we want to go to a space where our original dimensions $x_1,\dots,x_n$ are each expressed as a certain $x_i^a x_j^b$ with $a,b\in\mathbb{N} \land a,b\le k \land i\ne j \land 0\le i,j\le n$, and the Gaussian (RBF) kernel, especially useful to separate based on distances from a certain point (like for the two-circles case). More specifically, the Gaussian kernel is defined to be $$K(\mathbf{x,x'})=\exp(-\frac{\|\mathbf{x-x'}\|^2}{2\sigma})$$
Observe that, by Mercer's conditions, a symmetric function $K:X\times X \rightarrow \mathbb{R}$ implements an inner product in some Hilbert space if and only if it is positive semidefinite. In our case, we wish our Gram matrix to be positive semidefinite.
The problem Sklearn's SVM solves is formulated this way: $$\underset{\mathbf{w,\xi},b}{argmin}\frac{1}{2}\|\mathbf{w}\|^2+C\sum_{i=1}^n \xi_i$$ $$s.t. \forall i, y_i(\langle\mathbf{w,\psi(x_i)}\rangle+b)\ge 1-\xi_i \land \xi_i\ge 0$$ which is a generalization of the linear case. Again, we tune the hyperparameter C.
In our two-class case, it's clear that the outcome of the prediction $Y$ given the $p$ predictors $X_{1},\dots,X_{p}$ (our features) is a Bernoulli variable (positive = success with probability $p$, negative = failure with probability $1-p$). Il would be then wrong to use a linear regression model, since the linear combination $\sum \beta_{i}x_{i}$ should only be used for normally distributed responses and it can assume any value in $\mathbb{R}$, when we are only interested in the interval $\left[0,1\right]$ (we sure don't admit probabilities that are less than 0 or greater than 1).
A generalization of linear regression to non-normal responses are generalized linear models (GLM). A GLM has three components:
In the case of the Bernoulli family, $Y_{i} \sim Bernoulli (p_{i})$, $\mu_{i}=p_{i}$ (according to the configuration of the predictors, $p_{i}$ varies and represents the probability that the sample belongs to the positive class). A common choice for the link function for the Bernoulli case is the logistic function, $$logit(p)=log(\frac{p}{1-p})$$ The quantity $\frac{p}{1-p}$ is called odds and, as $p$ varies in $\left]0,1\right[$, it takes any value in $\left]0,+\infty\right[$. We take the log so that it covers the whole real line:
Now it makes sense to see it as a linear combination: $\eta_{i}=\sum\beta_{j}x_{ij}=logit(p_{i})$, so that, since $g(\mu_{i}) = \eta_{i}$, then $$p_{i} = \mu_{i} = g^{-1}(\eta_{i}) = \frac{e^{\sum\beta_{j}x_{ij}}}{1+e^{\sum\beta_{j}x_{ij}}}$$ ($j$ varies from $0$ to $p$, and we set $x_{i0} = 1$ for every $i$).
When training, we find the vector $\boldsymbol{\beta}$ that solves $$\max\limits_{\boldsymbol{\beta}}l(\boldsymbol{\beta})=\prod_{i:y_{i}=1}p(x_{i}) \prod_{i:y_{i}=0}(1-p(x_{i}))$$ ($i$ spans the whole training set). In practice, a regularization term to avoid overfitting is added, so that, after defining $L(\boldsymbol{\beta}):=-l(\boldsymbol{\beta})$ as the logistic loss, the problem becomes $$\min\limits_{\boldsymbol{\beta}}L(\boldsymbol{\beta})+\lambda\|{\boldsymbol{\beta}}\|_{2}^{2}$$ Once the $\beta_{i}$s are found, given a new test observation $(x_{1},\dots,x_{p})$ we can predict the probability that it belongs to the positive class just by calculating $$\hat{y}_{i}=\frac{e^{\sum_{j=0}^p \beta_{j}x_{j}}}{1+e^{\sum\beta_{j}x_{j}}}$$
Following a more formal approach, GLMs can be generalized for $Y_{i}$ belonging to the exponential family, to which the Bernoulli turns out to belong: $$f(y_{i};\theta_{i},\phi) = \exp\{\frac{y_{i}\theta_{i}-b(\theta_{i})}{a_{i}(\phi)}-c(y_{i},\phi)\}$$ where $\phi$ is a nuisance (dispersion) parameter related to the variance of the distribution, $\theta_{i}$ is the parameter of interest and $a, b, c$ are functions of these parameters. The likelihood becomes $$L = \prod f(y_{i};\theta_{i},\phi) = \exp\{\sum\dots\}$$ and the log-likelihood is therefore $$\log(L) = \sum(\frac{y_{i}\theta_{i}-b(\theta_{i})}{a_{i}(\phi)}+c(y_{i},\phi))$$
For the Bernoulli family, we can rewrite $$f(y_{i};p_{i}) = p_{i}^{y_{i}}(1-p_{i})^{1-y_{i}} = \exp{\{y_{i}\log(\frac{p_{i}}{1-p_{i}})+\log(1-p_{i})\}}$$ and it's clear that $\theta_{i}$ is, in our case, $\log(\frac{p_{i}}{1-p_{i}})=logit(p_{i})$. The link function that we used above turns out to be the natural parameter for this family.
QDA is based on the assumption that each $x \in \mathbb{R}^n$ is the realization of an n-dimensional random vector $X$ whose distribution varies in two populations, say the posive (+) and the negative (-). If X belongs to the + population, then $X \sim \mathbb{N}(\mu_+,\Sigma_+)$, otherwise $X \sim \mathbb{N}(\mu_-,\Sigma_-)$. Note the assumption.
When we have a new sample and we want to make a prediction, we could assign it to the positive population if $$\frac{f(x;\mu_+,\Sigma_+)}{f(x;\mu_-,\Sigma_-)}>t$$ where t is a threshold that depends on how much bigger the positive population is, on the loss due to wrong predictions etc. Using the densities and after some algebraic passages, we have $$exp\{-\frac{1}{2}(x-\mu_+)^\intercal \Sigma_+^{-1}(x-\mu_+)+\frac{1}{2}(x-\mu_-)^\intercal \Sigma_-^{-1}(x-\mu_-)\} > t\frac{|\Sigma_+|^{\frac{1}{2}}}{|\Sigma_-|^{\frac{1}{2}}}$$ which is equivalent to $$(x-\mu_-)^\intercal \Sigma_-^{-1}(x-\mu_-)-(x-\mu_+)^\intercal \Sigma_+^{-1}(x-\mu_+)>2log(t\frac{|\Sigma_+|^{\frac{1}{2}}}{|\Sigma_-|^{\frac{1}{2}}})=t^*$$ so we assign $x$ to the positive population if $$(x-\mu_-)^\intercal \Sigma_-^{-1}(x-\mu_-)-(x-\mu_+)^\intercal \Sigma_+^{-1}(x-\mu_+)>t^*$$ The separation surface we get is quadratic, and this is why this method is called Quadratic Discriminant Analysis (QDA). The quantities $(x-\mu_-)^\intercal \Sigma_-^{-1}(x-\mu_-)$ and $(x-\mu_+)^\intercal \Sigma_-^{-1}(x-\mu_+)$ are called the Mahalanobis distances of $x$ from $\mu_-$ and $\mu_+$ respectively. The Mahalanobis distance tells us how close they are, using the variance of each feature for normalization. QDA doesn't make any assumption on covariance matrices, which can be different, performing well in the presence of heteroscedasticity (different covariances).
When the distributions of each class are assumed to have the same covariance matrix $\Sigma$ (homoschedasticity), we can see by substitution that we assign $x$ to the positive class if $$-2x^\intercal\Sigma^{-1}\mu_-+2x^\intercal\Sigma^{-1}\mu_+>t^*-\mu_-\Sigma^{-1}\mu_-+\mu_+\Sigma^{-1}\mu_+=t^{**}$$ that is $$x^\intercal\Sigma^{-1}(\mu_+-\mu_-)>\frac{t^{**}}{2}=t^{***}$$ We therefore get a linear boundary and this is why it's called Linear Discriminant Analysis (LDA).
The following image shows us how QDA and LDA perform with heteroschedasticity and homoschedasticity:
The quantities $\mu_+,\mu_-,\Sigma_+,\Sigma_-$ are estimated from the training set. We have no hyperparameter in QDA and LDA, so we just test them with one stratified k-CV. Note that our dataset is not gaussian, so we do not expect high accuracy.
Classification trees make perhaps the simplest and easiest to interpret classifier. Indeed, we segment the feature space into simple regions, and the set of splitting rules used during this segmentation can be summarized in a tree. In the following picture we have a graphical representation of such a segmentation, where $R_{i}, i=1,\ldots,5$ are different non-overlapping regions, and depending on the region we can predict the class:
Another example is the following, were whe have 4 different features (sepal length, sepal width, petal length, petal width) and 3 different classes (setosa, versicolor, virginica). We can represent the decision tree by means of ${4 \choose 2} = 6$ 2D plots as follows:
When training a tree, we basically divide the set of possible values for $X_{1},\ldots,X_{p}$, where $p$ is the number of predictors, into J disjunct regions $R_{1},\ldots\,R_{J}$. Everytime an observation falls into a certain region $R_{i}$ we always associate it to the same class. At each step of the building process, we follow a greedy approach choosing the best split at that particular step. Splits are made selecting a predictor $X_{j}$ and a cutpoint $s$ so that we get two regions, $\{X \mid X_{j}<s\}$ and $\{X \mid X_{j}\geq s\}$, leading to the greatest reduction in a certain error function. We then repeat the process with one of the two resulting regions and so on, recursively, until we reach, for instance, a maximum depth (chosen by performing k-CV).
An error function we can use when splitting is the Gini Index defined as $$G=\sum_{k=1}^K \hat{p}_{mk}(1-\hat{p}_{mk})$$ at each branch, where $\hat{p}_{mk}$ represents the proportion of training observations in the $m$th branch that belong to the $k$th class. Given a node, its Gini Index is given by the weighted sum of the Gini Indexes of its branches. A small Gini index indicates that the resulting partition is highly heterogeneous (that is, observations belong predominantly to a single class), thus pure. An alternative is cross-entropy $$D = -\sum_{k=1}^K \hat{p}_{mk}\log{\hat{p}_{mk}}$$
There's also a weaker criterion, the classification error rate, $$E=1-\max_{k}(\hat{p}_{mk})$$ wich is simply the fraction of the training observations in the branch that don't belong to the most common class. This criterion is not sufficiently sensitive for tree-growing, so we prefer the other two measures. Here is a graphical comparison of these three measures, in a two-class situation, where we can see the similarity between Gini Index and cross-entropy:
Sklearn's Decision Tree Classifier uses Gini Index by default. The hyperparameter we tune is the maximum depth of the three.
Trees are not as accurate as other classifiers, so they are basically never used as they are. However, we can aggregate many decision trees, dramatically improving the predictive performance. What we get is a random forest, called this way due to the fact that the building process of its trees is randomized:
The hyperparameter we tune here is the number of trees $B$.
Considering the testing results, it's evident that the best classifier for us are Random Forest (with 500 trees we get a mean AUC equal to 0.912) and SVM with the RBF kernel (with C=10 we get a mean AUC equal to 0.913). QDA and LDA perform even more poorly than simple decision trees due to the non-gaussianity of the dataset. The weakest classifier for this dataset is Logistic Regression.
LDA can also be seen as a dimensionality reduction technique. The objective here is finding a line so that the projections of the samples of the two different classes are well separated. It's very different from PCA; indeed, PCA is an unsupervised technique that maximizes variance regardless of the classes; LDA, instead, is a supervised technique that maximizes class separation (even though it can be seen as a dimensionality reduction technique as well, but for $n$ classes we can only get $n-1$ dimensions).
We have 2 classes and $d$-dimensional instances $x_1,\dots,x_n$, where $n1$ instances come from the first class and $n2$ from the second class.
The line we want to get is given by a unit vector $v$. Now, $v^ \intercal x_i$ is the projection of $x_i$ onto a line in direction $v$. Since we want to measure separation between projections of two different classes, let $\tilde{\mu}_1$ and $\tilde{\mu}_2$ be the means of projections of classes 1 and 2. Let $\mu_1$ and $\mu_2$ be the means of classes 1 and 2. we have that $$\tilde{\mu}_1 = \frac{1}{n_1} \sum_{x_{i}\in{C_1}}^{n_1} v^\intercal x_i = v^\intercal(\frac{1}{n_1}\sum_{x_{i}\in{C_1}}^{n_1}x_i) = v^\intercal \mu_1$$ Same goes for $\tilde{\mu}_2$, that is, the means of the projections are the projections of the means. To consider variance of the classes, we shouldn't use $|\tilde{\mu}_1-\tilde{\mu}_2|$ as a measure of separation. Indeed, consider this example:
the previous measure would make us choose the horizontal direction because the means are more apart, but there's a high variance so that separation would be useless: the best direction is actually the vertical one, because the samples are much less spread. Therefore we normalize $|\tilde{\mu}_1-\tilde{\mu}_2|$ by a quantity called scatter. The scatter is simply the sample variance multiplied by the sample size. So, the projection scatters of classes 1 and 2 will be $$\tilde{s}_1^2=\sum_{y_{i}\in{C_1}}^{} (v^\intercal x_{i}-\tilde{\mu}_1)^2 \qquad \tilde{s}_2^2=\sum_{y_{i}\in{C_2}}^{} (v^\intercal x_{i}-\tilde{\mu}_1)^2$$
Since we need to normalize by both scatters, FDA projects on the line in the direction $v$ which maximizes $$J(v) = \frac{(\tilde{\mu}_1-\tilde{\mu}_2)^2}{\tilde{s}_1^2+\tilde{s}_2^2}$$
We now wish to express $J$ explicitly as a function of $v$. Let $S_1, S_2$ be the separate class scatter matrices for classes 1 and 2 (they measure the scatter of original samples before projection): $$S_1 = \sum_{x_{i}\in{C_1}}^{}(x_i-\mu_1)(x_i-\mu_1)^\intercal \qquad S_2 = \sum_{x_{i}\in{C_2}}^{}(x_i-\mu_2)(x_i-\mu_2)^\intercal$$ and let $S_W = S_1+S_2$ be the within-class scatter matrix. With some algebra we get $$\tilde{s}_1^2=v^\intercal S_1 v \qquad \tilde{s}_2^2=v^\intercal S_2 v$$ therefore $\tilde{s}_1^2+\tilde{s}_2^2=v^\intercal S_W v$.
Now, defining the between-class scatter matrix as $$S_B= (\mu_1-\mu_2)(\mu_1-\mu_2)^\intercal$$ ($S_B$ measures separation between the means of the two classes before projection) then the separation of the projected means is $$(\tilde{\mu}_1-\tilde{\mu}_2)^2 = v^\intercal S_B v$$ so that our original objective function can now be rewritten:
$$J(v) = \frac{(\tilde{\mu}_1-\tilde{\mu}_2)^2}{\tilde{s}_1^2+\tilde{s}_2^2} = \frac{v^\intercal S_B v}{v^\intercal S_W v}$$After some algebra we get that maximizing it is the same as solving $$S_B v = \lambda S_W v$$ (generalized eigenvalue problem). If $S_W$ has full rank, the solution is $$v=S_W^{-1}(\mu_1-\mu_2)$$
Based on the resulting one dimensional distribution of the projected instances, we can choose a separating hyperplane to use for classification.